Add error to point user to slurm resume log#676
Conversation
14ea243 to
881d029
Compare
| self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False) | ||
| self._update_failed_nodes( | ||
| set(nodes_resume_list), | ||
| "InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)", |
| self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False) | ||
| self._update_failed_nodes( | ||
| set(nodes_resume_list), | ||
| "InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)", |
There was a problem hiding this comment.
InsufficientInstanceCapacity is. a common pcluster error code used in different places. Can we define a constant for it?
Same for the sentence Check....codes. Since we repeat that in many places we can have a constant
| self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False) | ||
| self._update_failed_nodes( | ||
| set(nodes_resume_list), | ||
| "InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)", |
There was a problem hiding this comment.
[BLOCKING] I agree in making the log line more helpful, redirecting the user to the right log. However, doing it this way actually changes the error code from InsufficientInstanceCapacity to InsufficientInstanceCapacity(Check...codes), which ultimately can have consequences in the way we monitor ICE errors on the clister dashboard.
See
| log.info( | ||
| "The following compute resources are in down state due to insufficient capacity: %s, " | ||
| "compute resources will be reset after insufficient capacity timeout (%s seconds) expired", | ||
| "compute resources will be reset after insufficient capacity timeout (%s seconds) expired. " |
There was a problem hiding this comment.
[Test] Can we reflect this change into the corresponding unit test. The same thing you did for the resume script.
Description of changes
Tests
In slurmctld:
[2025-10-30T12:16:05.106] update_node: node compute-dy-cit-10 reason set to: (Code:InsufficientInstanceCapacity)Failure when resuming nodes - Check the slurm_resume log for EC2 error codesIn clustermgtd:
2025-10-30 12:26:48,861 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'compute': {'cit': ComputeResourceFailureEvent(timestamp=datetime.datetime(2025, 10, 30, 12, 16, 42, 904354, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired. Check the slurm_resume log for EC2 error codes.In slurm_resume:
2025-10-29 13:39:59,768 - 8667 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 2 - Error in CreateFleet request (aa9a6ad3-7ac3-4745-a4d1-b8c178907b8b): InvalidParameter - Security group sg-0f2789bcd3e49cdf3 and subnet subnet-0c766771e11dab28c belong to different networks.Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.